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the appropriate item identification numbers*. If a user knows only the 
characteristics of the items he wants, he must submit a coded profile 
outlining the search restrictions that should and/or should net be 
met by the retrieved items. A thesaurus lists the coded variables and 
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question (single or multiple answer) , taxonomic level (factual, 
comprehension, or problem solving), difficulty level, last year 
question was used,, etc. The prefiling oi structuring procedure for 
search requests is detailed. Results frem the use of random and 
sequential versions of the system are presented in erder to document 
a comparison of the two methods. An expanded presentation of this 
system appears in the author's unpublished master's thesis, (JY) 
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ABSTRACT 



The purpose of this documentation is to describe 
the design and implemented modifications made to the 
Medsirch^ retrieval system; this description includes 
profiling examples to illustrate the retrieval potential 
of this system. 

Results from the use of random and sequential direct 
access files is also reported for purposes of comparing 
the desirability and feasibility of implementing such 
files. 



Medsirch is an acronym for medical search, a pro- 
gram designed to retrieve medical multiple choice 
questions. The original documentation of this system 
was described in a master thesis by this author (1969). 
If the reader has more extensive interest in the data 
management organization, record preparation and 
storage, updating facilities, and supporting programs 
he should refer to the cited thesis. 
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CHAPTER ONE 



INTRODUCTION 

The use of information storage and retrieval systems is a matter 
of everyday experience for literate people* The public library, corre- 
spondence files, accounting systems, directories, dictionaries, and so on, 
are all information systems. All are comprised of records to which one 
may address a variety of allowable questions with a reasonable expectation 
of retrieving a selection of records in response to each question. 

Medsirch is a machine system for the storage and retrieval of multiple 
choice items. At present it is being used by the R. S. McLaughlin 
Examination and Research Centre; it is hoped, however, that some of the 
design features in this system will provide a basis on which other examining 
bodies can receive similar services for the retrieval of large masses of 
data. 

The particular advantage of using a machine, that is, a computer, 
for retrieval is pointed by Baruch (1966, p. 27). He feels that: computers 
greatest assistance is doing tasks such as sorting, filing, indexing, 
searching, and particularly, being alert for low probability occurrences. 
Indeed it is this kind of "light thinking" that computers do especially 
well and that intelligent people seldom do correctly. 

In any retrieval system, machine based or otherwise, records are 
created and organized before the specific questions a system is to answer 
have been stated (that is, the system is created in anticipation of needs 
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that are not fully known) • Lipetz (1966 9 p. 178) points out that it would 
be impossible to design a retrieval system that could respond to all 
possible questions and prohibitively expensive to try to approximate such 
a condition. The type of questions which the Medsirch system was designed 
to handle is explained in chapter three of this report. Chapter four 
examines the limitations that the Medsirch imposes on the user ! s questions, 
chapter five specifies the cost of asking ‘them, while Appendices A and B 
provide: the thesaurus and profiling procedures for submitting these questions. 

However, it is not only the type of questions which will be addressed 
to a system which influences its design. Consideration must also be given 

i 

to the characteristics of the medium in which the records are to be stored 
and retrieved* This author (p. 90, 1969) has already indicated that one 
of the limitations in the retrieval field is the lack of comparisons being 
made between different types of file organizations. The literature is not 
lacking in suggesting hypothetical designs; however, this source gives lit- 
tle or no concrete evidence as to which file organization is most useful, 
efficient, and/or economical. In order to provide more information to the 
reader regarding the differences between sequentially and randomly accessed 
files, chapter two will discuss the merits and demerits of these models as 
related to the Medsirch system. While it is true that the discussion is in 
terms of searching multiple choice items, some of the features specified 
are applicable to any data base. 
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^HAPTEU TWO 

COMPARISON OF MEDSIRC1 FILE ORGANIZATIONS 



Sequential Organization 

Also referred to as Direct File Organization, this method retrieves 
items by a sequential scan of the complete file. Salton (1968, p. 244) 
indicates that such a file is suitable if information is to be retrievable 
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according to a variety of different keys 
to store many copies of the same file to 
file orders," The response time for sequ 
however, since a complete file scan is ge‘ 
ation can be retrieved. Updating files w: 
also disadvantageous since rewriting sequ* 
copying records from one data set to anot 
and would only be done when a number of r 



since it is not usually possible 
ic count for the various desired 
mtial file searches is not optimal, 
lerally needed before any inform- 
,th this type of organization is 
ntial files is usually done by 
,er as needed. This is expensive 
(icords have to be altered. 



Random Organization 



In such a file records are stored 
relationship between the key of a record 
location where the record is stored. ThL 
stored and again when it is to be retrie 
generally used for accessing records - d 
and calculation - the first of which was 
address is used if' the programmer, knowii 



and referenced on the basis of the 
and the direct address of the 
s address is used when a record is 
ed. There are three methods 
rect address, dictionary look up, 
used in the Medsirch System. Dir^t 
,g the precise size and number of 
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of records in his data is able to supply the direct address at storage time, 

Bleier and Vorhaus (1968) found some advantages in the use of random 

access: (a) queries were retrieved rapidly since only relevant records were 

searched, and (b) the size of data base had little effect on the speed of 

retrieval. However they also indicated the disadvantages 2 (a) increased 

storage requirements to handle the list of addresses in core, and (b) a 

significant increase in the complexity of maintaining the system. Dodd 

(1969) also pointed out an additional shortcoming of random access files. 

Although random organization does allow for rapid 
access of a particular record with a known key, it 
is not suited for rapidly accessing a number of 
records. This limitation is imposed by time taken 
by the handware access mechanism to locate a record. 

[p. 122]. 

Dodd (1969) as well as IBM (1966) point out: that records must be fixed length 
if stored in random access; any data base with variable length records must 
be either manipulated to form fixed length or be stored inefficiently as 
fixed length records of maximal size. Finally, IBM (1967, pp. 72-73) points 
out that before a random direct access data set can be used the machine 
must locate, format and write a skeleton record for each record in the 
information bank. Senko (1969, p, 121) states that this loading of a random 
file is 20 to 100 times longer thau Che corresponding loading done sequen- 
tially. Since this is very slow random access data sets are usually created 
and then preserved for the life of the file. 

Medsirch Results 

In general the Medsirch system supports the literature in the 
comparative use of sequential and random files. It has been found, for 
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instance, that updating sequentially is only justified when there is a large 
number of records to be inserted, deleted, and/or modified (cf. p. 3), It 
has also been found that to skeletalize a random file took approximately 
8 minutes for 10,000 records, a time which prohibited the use of creating 
a temporary random file for each batch of retrievals. (The reader should 
note that this time taken to set up the random access file should be spread 
over the life of the file.) The relative cost of setting up skeleton records 
is inversely related to the number of requests made to the bank between 
updates. If the bank is moderately active, requiring regular updates 9 this 
installation cost reduces the efficiency of random access noticeably. In 
the Medsirch system at the present time there is almost a one to one relation- 
ship between the number of requests and the number of updates. As such 
random access installations costs are enormous, relatively speaking. On 
the other hand, 10,000 records are transferred from tape to sequential 
disc in approximately 0.21 minutes (that is, in support of Senko (cf. p. 4) 
Medsirch found sequential loading to be 40 times faster than random loading). 
Sequential loading time is so sJIght that a temporary data set can be created 
for each batch of search requests and thus eliminates the cost of permanent 
disc storage. 

Since random files require fixed length records (cf. p. 4), and the 
data base of the Medsirch system was variable in length, the author chose 
to make the two compatible by programming. This required little effort and 
did not in any way distract or add to the feasibility of random access in 
the Medsirch system. 
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However, in random files the job control language (JCL) for the IBM 
360/67 does not handle blocksizes longer than the logical record length 
(IBM (1967, p. 56)). It is here that sequential files show a distinct 
advantage since JCL will accomodate a blocksize of 7294 bytes on sequen- 
tial disc. Using IBM's (1967) figures it is possible to show what this 
advantage is. There is an average access time of 75 milliseconds, average 
rotational .delay of 12.5 milliseconds, and a transmission time of .26 
milliseconds (for a total of 87.76 milliseconds) per 80-byte-record, Thus 
to access 91 80-byte- records it would take approximately 8 seconds (91 * 

87.76 milliseconds). If these 80-byte-records were blocked with 91 records 
per block it would take only 1/10 of a second to access, or 80 times as fast. 
Thus sequential files which are blocked in this way can access 91 sequential 
records, 80 times faster than an unblocked random file accessing those same 
91 records one at a time. Thus if a search is only made for 10 records 
within a bank of 10,000 records random access would take 8/10 of a second 
(10 x 87.76 milliseconds); a sequential search would take 11 seconds 
(„1 (10,000 r 91)) to access the same 10 records* However, to access 200 
records (2% of the bank) random access would take over 17 seconds (200 x 
87.76 milliseconds) and sequential access would still be 11 seconds. That 
is to say, the number of records being accessed has negligible time effect 
in sequential files since the entire file must be searched for each request; 
this does not apply to random access since only relevant records are accessed. 
The reader should note that these figures reflect the differences between 
input/output (1/0) times for sequential and random access. 
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Dodd (cfo p. 4) mentioned that random access was unsuited for accessing 
M a number of records 11 . More specifically, if the sequential file has block 
sizes 91 times as great as the block sizes of random files, the execution 
time for I/O will be less in sequential files if one is accessing more than 
1 1/2% of the records in a bank of 10,000 80-byte- records. Since Bleier 
and Vorhaus (cf. p, 4) found random access almost invariant to the size of 
the pool, one cannot make a generalized statement regarding this 1 1/2% 
trade off between random and sequential files. It works out, in fact, that 
if the pool had 100,000 records, one would have to access more than 12 1/2% of 
the pool before sequential I/O time would be less than random access 1/0 
time. The following algorithm can be used by the reader to estimate the 
trade off for his data base of 80 byte records. 

. 1(N r 91)= T Where N = number of records in total bank 

91 = maximum blocking factor for 80 byte 
records 

T = I/O execution time (seconds) for 
sequential search of bank. 

T = R Where R is the number of records retrieved at 

.08776 even trade-off between random and sequential 

I/O time. 

To convert R to a percentage: 

R x 100 = P °/o 

N 

Thus if one is retrieving less than P °/ 0 of the bank random access I/O 
time will be less. 

The above algorithm does not: reflect the trade-off in terms of total 
execution time unless the amount of calculations done independent of I/O 
remains constant in both random and sequential programs: Table 1 shows 

that in the Medsirch system the amount of calculations independent of I/O 
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TABLE 1 

COMPARATIVE EXECUTION TIMES 
(in average minutes per multiple choice item) 



File Organization 



Sequential 



Random 



Implemented Program 



Medsirch - 3 



Medsirc'h - 4 



Loading Time Average Search Times 

I/O CALCULATION TOTAL 



0.21 



0.0017 0.030 0.0317 



8.00 



0.008 0.0085 0.0165 



This time must be included in comparing sequential and random execution 
times. Loading time has been kept separate from the "Average Search Times" in 
this table since the relative cost of loading time in random files is inversely 
related to the number of requests made to the bank between updates (cf. p. 5). All 
cited figures are based on the use of the IBM 360/67 computer. 
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is greater for sequential searches ; this is mainly due to the following 
reasons. (1) Medsirch - 3 (sequential search) was developed and modified 
over a period of two years, with each additional feature being added 
separately and on the basis of programming simplicity, not on the basis 
of execution efficiency. Medsirch - 4 (random search) was developed after 
Medsirch - 3, with all features being incorporated simultaneously; as such 
Medsirch - 4 was written in a more efficient manner. (If the reader has 
programming experience he will appreciate the difference between these two 
situational requirements.) Until Medsirch - 3 is completely rewritten for 
maximal efficiency, the calculation time estimated for Medsirch - 3 should 
be regarded as an upper limit. (2) The nature of the data base in the 
Medsirch system necessitates more calculations when sequentially searched. 

If only one record (or a fixed number of records) was selected per retrieval, 
this additional calculation would not be necessary. Multiple choice 
questions, however, vary in length from five to 100 records, and thus a 
check must be made on each record to determine if it is the last record for 
a particular multiple choice question. 

One additional comment should be made here regarding the above 
algorithm. If one were to combine 91 records into one read/write state- 
ment so that logical record lengths were increased to the blocksise used 
in sequential searches one might overcome the limitation of no JCL blocking 
in random access. This would of course impose at least two constraints. 

(1) All records would have to be read/written under the same format, namely 
alphanumeric, and as such only logical (not arithmetic) comparisons would 
be used. The implication of this is discussed later (cf. p. 20). 
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(2) One would have to hold in core, addresses to locate the appropriate 
1260 byte record as well as the part of that record which was wanted for 
retrieval. Therefore while the cost of I/O time may be reduced, the cost 
of core storage would be Increased* The issue of core storage provides 
another basis for comparing sequential and random searches in the Medsirch 
system and will be now discussed, 

Medsirch - 3 (sequential) and Medsirch - 4 (random) required 96K 
and 188K bytes of core storage respectively. These figures reflect the 
fact that additional space is needed for dictionaries and addresses when 
random direct access is used. Furthermore, as the item pool increases 
cor^ requirements for Medsirch - 4 go up by a ratio of IK for each nine 
additional multiple choice questions while core requirements for Medirch - 
3 remain relatively unchanged. 

The differences in execution times and core requirements of sequential 
and random access indirectly determines the useability of these two files. 

The cited core requirements for Medsirch - 4 (random search) is based on 
a pool of 648 multiple choice questions; if the pool was twice as large 
(1296 items) core requirements would be 254K. It is obvious that as the 
item pool increases one might have to reduce the choice of search terms; 
for example, instead of using all of the 57 variables in Appendix A, use only 
26 variables for each batch of requests. 

On the other hand, not only is the core requirements of Medsirch - 3 
relatively unaffected by the size of the pool, but it is also relatively 
unaffected by the number of search terms in Appendix A. Medsirch - 3 is, 
however, restrictive in the number of terms one may use simultaneously. 
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This is due to the fact that items which do not meet all, but do meet some, 
search terms may also be considered as relevant by the user. Such items 
in Medsirch - 3 are written onto additional temporary data sets, and may 
be retrieved later if the main pool does not provide enough items meeting 
all search terms. Thus if a large number (e.g.”X”) simultaneous search 
terms were used, it would also be necessary to use "X" additional data 
sets in a generalized program. I/O time was found to increase signi- 
ficantly with each additional simultaneous search term, and partially 
accounted for the fact that Medsirch - 3 I/O time was not always signi- 
ficantly less than Medsirch - 4 when large portions of the pool were 
retrieved. 

Salton has pointed out the applicability of sequential files 
(cf. p. 3). More specifically this author suggests that if one’s data 
lends itself to deep indexing, but within a restricted range cf choices 
for search terms, random access seems to offer the greatest flexibility. 

On the other hand, if one’s data requires a very broad choice for search 
terms and can be, searched with shallow indexing, sequential searches seem 
to be a more viable alternative than random searches. However, one must 
also consider the average proportion of the total pool being retrieved as 
well as the feasible amount; of core storage, before deciding which file-^ 
sequential or random — is most suitable to his particular needs. 

Since shallowing indexing with a broad choice of search terms is 
suitable to the needs of the R. S. McLaughlin Centre, and because the 
average retrieval time per item for Medsirch - 4 is not significantly 
better than Medsirch - 3, this author must concur with Senko’s (1969, p. 121) 
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statement that "the applicability and desirability of random access . . . 
become [sic] extremely restrictive." In summary the reader should consult 
Table 2 for a list of the summarized differences between random and 
sequential files as found in the Medsirch system. What is now necessary 
is an investigation to determine where in this continuum of useability 
list files are to be placed. 
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TABLE 2 

DIFFERENCES BETWEEN SEQUENTIAL AND RANDOM FILES 
(as found in the Medsirch System) 



A. Random : 

1. Core requirements are greater than sequential* 

2. No JCL provision for blocking: 

(a) I/O time thus is increased; 

(b) if blocking done by programming: 

(i) only logical comparisons possible, 

(ii) core requirements are increased further. 

3. As the number of records in the bank increases: 

(a) core requirements increase, 

(b) execution time remains relatively constant. 

4. Suitable for deep indexing. 

5. Not suitable to a large choice of search terms. 

6. Permanent disc space required: 

-loading time is 40 times greater than sequential. 

7. Updating: 

(a) if records are deleted or replaced execution time is efficient; 

(b) if records are inserted as additions efficiency is poor. 

8. Must use only fixed Length records. 

9. Adequate maintenance of file is more involved. 

10. Not suited to retrieving large portions of bank. 



B. Sequential : 

1. Core requirements is less than random. 

2. JCL blocking is available: 

-I/O time for maximally blocked 80-byte-records is approximately 
5% of execution time. 

3. As the number of records in the bank increases: 

(a) core requirements remain relatively constant; 

(b) execution time is increased. 

4. Suited to shallow indexing. 

5. Allows a great variety of searchable terms. 

6. Temporary disc space is only needed. 

7. Updating: 

(a) requires rewriting entire data set; 

(b) no particular difference between deletions, changes or additions. 

8. Fixed or variable length records can be used. 

9. Maintenance of file is minimal. 

10. Not suited to retrieving small number of records from bank. 
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CHAPTER THREE 



MEDSIRCH STRATEGY 



In order to search each item (I) in the pool it was categorized 



as I, 



V V V 

l,k; v 2,k; ...; v 57,k 



, where 



V. 



are variables identi- 



j,j=l - 57 

fying such item parameters as area of subspecialty, type of question, 
taxonomic level, etc. Each variable (V^.) has its own subdivisions (k) ; 
that is, each variable has certain values. For example, variable ^ 

(area of subspeciality) may take values of k=l,2,...,23 where each value 
of k stands for allergy, cardiovascular, ..., physiology respectively. 

The reader should refer to Appendix A for a list of all variables and 
the values each variable may take. This thesaurus contains all search 
terms (i.e., search restrictions available in the Medsirch system. 

The basic strategy for retrieval in this system is flowcharted in 
Figure 1. The reader may wish to consult this chart as the following 
explanation is given. 

The Medsirch strategy makes provision for retrieving items on the 
basis of prior knowledge of the item bank and also on the basis of no prior 
knowledge. If the user knows exactly which items he wants he may retrieve 
them by providing a list of item identification numbers (Figure 1: C, N, 

0) . If the user does not know exactly which items he wants, but does 
know the characteristics of 3uch items, he must then submit a request 
specifying what search restrictions (V , ) items should or should not 

J iC S 

meet. Such a request is called a profile. 
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A. 1 




B. 2 



G. 3 



J. 



READ USER REQUEST AND REWIND DATA SET 




READ NEW ITEM FROM ITEM BANK 



ARE ITEM'S RETRIEVED BY ID? 



NO 



END OF BANK? 



NO 






ITEM MEETS "NOT" RESTRICTIONS? 



NO 



COUNTER = TOTAL it OF RESTRICTIONS 




ITEM MEETS COUNTER RESTRICTIONS? 



NO 



DOES THRESHOLD WEIGHT ALLOW 

RETRIEVAL FOR LESS RESTRICTIONS? 
YES 

^ 

IS THE it OF ITEMS MEETING MORE 

RESTRICTIONS SUFFICIENT FOR USER? 
NO 



COUNTER = COUNTER - 1 



YES 



YES 



YES 



GO TO 5 



GO TO 1 



GO TO 2 



YES \ 



GO TO 4 



NO 



GO TO 2 



YES 



* 



GO TO 2 



GO TO 3 



Figure 1. Strategy for Medsirch 
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K. 



L. 



M. 





N. 



0 . 





Figure 1. Strategy for Medsirch 
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For each profile submitted by the user a search is made of the 
entire bank (Figure 1: A, B, D) . An item which is immediately ignored 

may have one or more of the following characteristics. (1) It may 
possess some "Not” characteristics (that is, an item may have a char- 
acteristic which the user dees not want); see Figure Is E. (2) An 
item may meet no search restrictions (that is, it does not match any 

terms (V, , , ) in the user’s profile; see Figure 1: F - J. (3) The 

j , k s 

number of search restrictions it does meet may be below the threshold 
weight, where threshold weight is defined as the number of search 
restrictions that must be met by an item in order for it to be retrieved; 
see Figure 1: H. 

The remaining items are considered potential retrievals, the 
number of which that is actually retrieved will be decided upon by the 
interaction of the user’s request, the number of documents (i.e., number 
of multiple choice items) wanted, and the number available for retrieval. 

Basic to most retrieval designs is an iterative feature for 
approximating the user’s need if the nature of the bank dictates that the 
complete request of the user cannot be fulfilled. In the Medsirch system 
this is accomplished by first retrieving items which meet all restrictions. 
If this constitutes an insufficient number o:t retrieved documents, items 
meeting one less restriction are also selected. If the total number of 
items selected to this point is still not enough, those documents 
meeting two less restrictions are retrieved, and so on, until enough 
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items are retrieved or until the threshold weight is reached; see Figure 
Is F-M , 0 . 

If the search must iterate to select items which do not meet all 
restrictions, the user may specify which search restrictions he considers 
most important. With this information the computer can select items which 
do not meet all restrictions, but do meet the most important restrictions. 

In this case, to minimize the amount of effort required by the user in 
preparing his profile, one of the following user's needs is assumed to 
exist. (1) The user considers that the order in which he has specified his 
search terms is important . Hence, if items are to be retrieved that meet 
less than the total number of restrictions (for example, four restrictions) 
then items meeting the first three restrictions are required next, then 
if necessary, the first two restrictions , etc. (2) The user considers that 
the order in which he has specified the restrictions is unimportant . In 
this case an iterative s;earch would take items meeting any three restrictions, 
then any two restrictions, etc. (3) The user wants to preserve the order 
of his restrictions only up to a certain point, for example, the threshold 
weight. In this case iterative searches would take any combination of 
restrictions after the first f x f number of restrictions had been met. In 
preparing his profile the user is only required to indicate which one of 
these three conditons is most suitable to himself. 
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In general the number of items obtained at any given iteration 
would be least in case (1) and greatest in case (2) with case 
(3) providing a number somewhere between these two extremes. Of course 
Lne more items obtained at each iteration, the less likely it would 
be that any further iterations were necessary. 

Finally, the user has the option of asking for a random selection 
of items if the opportunity presents itself; if he does not avail 
himself of this feature all items, at any given iteration, will be 
retrieved; see Figure 1: L, M, 0. For example, assume the user wanted 

10 items meeting four restrictions, and that the bank had 20 such items, 
the user could retrieve all 20 items or 10 randomly selected from the 
20 items available. If searches proceed to less restrictive items the 
random feature still works. For example, assume the same conditons as 
before but that only 8 items were available meeting the four restrictions, 
with 12 additional items meeting just three restrictions. In this case 
8 items meeting the four restrictions would be retrieved first; the 
user could then retrieve the next 12 items or obtain two randomly 
selected from the 12 in order to get the 10 items he wanted. Note, 
however, that if the threshold weight had indicated that only items 
meeting four restrictions were wanted, then randomly selecting the two 
items, or retrieving all 12, would have been impossible. 

To prepare a request the user must use the parameter values 
specified in the Medsirch thesaurus (Appendix A) and follow the format 
specifications as given in the Medsirch documentation (Appendix B) . The 
latter Appendix also provides profiling examples. 




25 



CHAPTER FOUR 



EVALUATION OF MEDSIRCH STRATEGY 

The reader may question the use of numbers instead of using the 
actual words (see Appendix A) for coding and searching. The question 
is a valid one since there is reason to believe the user may feel more 
comfortable using the verbalization of his mother tongue rather than an 
abstract numbering system. However, this author purposely avoided 
the use of words for the following reasons. (1) Word searches usely 
involves some form of truncation, which necessarily reduces the read- 
ability of outpuc. Medsirch output is directly useable with full text, 
proper spacing, and complete verbalization of the descriptors. 

(2) While truncation is not imperative with word searches, the problem 
of added storage, user misspellings, and excessive keypunching for both 
storing and searching becomes more prominent. (3) Logical comparisons 
are necessary in word searches. In terms of the computer this is less 
efficient than arithmetic comparisons which are possible if numbers are 
used for searches. (4) The use of word searches raises the question 
as to why not search the text of a multiple choice item. It is this 
author* s opinion that multiple choice questions cannot, at: the present 
time, be searched in t* is manner. Lipetz (1966) points out that 
Satisfactory comparison . . . requires the ability to recognize the 
important features in the word. This is not an easy task to turn over 
to a machine [p. 177].** Abelson (1968, p. 419) agrees with this point 
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of view, emphasizing the need for human judgement in information retrieval. 
He feels that professionals in individual fields of scientific research 
are essential custodians of knowledge who cannot be replaced by archives 
of any kind. 

The reader may also question the lack of weighting facilities for 
each search restriction and the lack of opportunity for the user to 
express his own strategy with logical operators. This author has tried 
using some retrieval programs with these options and has encountered 
t:he frustrating experience of either obtaining too few relevant articles 
or so many retrievals that it was impossible to meaningfully use them. 

In some cases one had to resubmit his profile in order to get what he 
knew were available articles but had, in previous requests, been unable 
to find. While the Medsirch system may not eliminate all such frus- 
tration, it does not require the user to laboriously devise his own 
weighting and logical scheme. Most, if not all, advantages of allowing 
the user to specify his weights and logic is accomplished in the 
Medsirch system by simply specifying three numbers, one each for the 
number of the items wanted, threshold weight, and importance of the 
order of the restrictions. In essence the weighting system and logic 
scheme is turned over to the computer. 

However, the Medsirch system is still hampered by many of the 
problems in other retrieval systems. 

(1) The user is still required to learn the system’s profiling 
technique before he can maximize its usefulness. 
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(2) The system is not generalizable to any retrieval of 
information; (i.e.» Appendix A is a limited thesaurus). 

(3) The computer has not been utilized to its fullest 
advantage for automatic retrievals. 

(4) At present the Medsirch strategy is linear in nature; 
if the user is defining new items as relevant or non- 
relevant on the basis of what he has already received, 
he may in fact be redefining relevancy throughout 
retrieval. The Medsirch system cannot adapt to this 
peculiar interaction between the user and the pool of 
potentially relevant items. One must learn more about 
the characteristics of each user before there can be 
less need for the user to do his own profiling. 




28 



CHAPTER FIVE 



COST OF IMPLEMENTING MEDSIRCH SYSTEM 

Before one is able to use the Medsirch system he must of course 
prepare his item bank. Each multiple choice item is punched onto cards 
along with two cards holding its descriptors (cf. Appendix A); these 
descriptors or indexes must be punched according to rigid format 
specifications. A Fortran program (CHECK) is available for checking 
the keypunching; other programs are available for stacking cards onto 
tape (UTILITY), sequentially revising the item pool (UPDATE) f dumping 
the item pool (BANDUM) , counting the number of records and items in 
the pool as well as dumping the pool of indexes (COUNT) , and creating 
a tape for holding the addresses and descriptors of all items to be 
searched randomly (DICT) . While all of these additional programs are 
not essential, they do facilitate the maintenance of the item pool^ 
which, if properly done* allows Medsirch - 3 or Medsirch - 4 to get 
more efficient and/or adequate retrievals. 

Table '3 provides a list of the costs in implementing the 
Medsirch system, including human requirements (that is, typing, coding, 
keypunching, revising, selecting relevant items) and machine require- 
ments (that is, tapes, discs, core, execution time). Cost is not given 
in terms of monetary values since financial cost of human requirements 



Any new user should not underestimate the importance of main- 
tenance of any pool of data. It is suggested that a specific timetable 
be established in developing the pool, maintaining it, and retrieving 
data. 
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as well as computer time and core space is relative to one’s institution. 
Figures are also included for modified hardware requirements; the reader 
is cautioned that any suggested modifications made, may reduce efficiency 
and/or user satisfaction. 
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TABLE 3 

ESTIMATED COST OF IMPLEMENTING MEDSIRCH SYSTEM 



Average Minute per Multiple Choice Item 



Man 

Typing . . . 7.0 min. 

Coding 

New item 3.0 

Revised item 6.0 

Keypunching 7.0 

Selecting relevant item after retrieval 5.0 

Reviewer* s checking content, spelling, etc 5.0 



Computer 

Program 

Check 

Utility 

Update 

Bandum 

Count 

Diet 

Medsirch - 3 
Medsirch - 4 



Input /Output 
0.00025 min. 
0.00023 
0.0017 
0.00025 
0.000028 
0.0000082 

see Table 1, 
see Table 1, 



Total Execution 
0.005 min. 
0.C0032 
0.034 
0.005 
0.00046 
0.00082 
p. 8 
p. 8 
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TABLE 3 (continued) 

ESTIMATED COST OF IMPLEMENTING MEDSIRCH- SYSTEM 



Program 



Check 

Utility 

Update 

Bandum 

Count 

Diet 

Medsirch - 
Medsirch - 



Hardware Requirements^ 

Anount of Core (Bytes) 

Execution Blocking Total 
(2 buffers) 



4k 




4k 


43k 


15k 


58k 


54k 


30k 


74k 


45k 


30k 


75k 


21k 


15k 


36k 


32k 


30k 


62k 


24k 


72k 


96k 


173k 


15k 


188k 



Without Modification 

2 

Data Sets Required 



1 

2 

1 

1 

2 

1 

1 



tape 

tapes 

tape, disc space for 1 temporary data set 



tape 



tapes 

tape, disc space for 3 temporary data sets 
tape, disc space for 1 permanent data set 



Total Requirements: 96k, 2 different tapes, and disc space for five temporary 

data sets if using sequential file; or 188k, 3 different tapes, and permanent disc 
space for 1 data set, and temporary disc space for 1 data set if using random file. 

2 

Data sets required in addition to card reader, card puncher, and printer* 
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